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Abstract 

Robust advances in interactome analysis demand comprehensive, non-redundant and consistently anno- 
tated datasets. By non-redundant, we mean that the accounting of evidence for every interaction should be 
faithful: each independent experimental support is counted exactly once, no more, no less. While many in- 
teractions are shared among public repositories, none of them contains the complete known interactome for 
any model organism. In addition, the annotations of the same experimental result by different repositories 
often disagree. This brings up the issue of which annotation to keep while consolidating evidences that are 
the same. The iReflndex database, including interactions from most popular repositories with a standard- 
ized protein nomenclature, represents a significant advance in all aspects, especially in comprehensiveness. 
However, iReflndex aims to maintain all information/annotation from original sources and requires users to 
perform additional processing to fully achieve the aforementioned goals. Another issue has to do with pro- 
tein complexes. Some databases represent experimentally observed complexes as interactions with more than 
two participants, while others expand them into binary interactions using spoke or matrix model. To avoid 
untested interaction information buildup, it is preferable to replace the expanded protein complexes, either 
from spoke or matrix models, with a flat list of complex members. 

To address these issues and to achieve our goals, we have developed ppiTrim, a script that processes 
iReflndex to produce non-redundant, consistently annotated datasets of physical interactions. Our script 
proceeds in three stages: mapping all interactants to gene identifiers and removing all undesired raw inter- 
actions, deflating potentially expanded complexes, and reconciling for each interaction the annotation labels 
among different source databases. As an illustration, we have processed the three largest organismal datasets: 
yeast, human and fruitfly. While ppiTrim can resolve most apparent conflicts between different labelings, we 
also discovered some unresolvable disagreements mostly resulting from different annotation poUcies among 
repositories. 

URL: www . ncbi . nlm. nih . gov/CBBresearch/Yu/downloads /ppiTrim. html 

Introduction 

The current decade has witnessed a significant amount of effort towards discovering the networks of protein- 
protein interactions (interactomes) in a number of model organisms. These efforts resulted in hundreds of 
thousands of individual interactions between pairs of proteins being reported (1). Repositories such as the 
BioGRID (2), IntAct (3), MINT (4), DIP (5), BIND (6, 7) and HPRD (8) have been established to store 
and distribute sets of interactions collected from high-throughput scans as well as from curation of individual 
publications. Depending on its goals, each interaction database, maintained by a different team of curators 
located around the world includes and annotates interactions differently. Consequently, while many interactions 
of specific interactomes are shared among databases (1, 9), no one contains the complete known interactome 
for any model organism. Constructing a full-coverage protein-protein interaction network therefore requires 
retrieving and combining entries from many databases. 
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This task is facilitated by several initiatives developed by the proteomics community over the years. The 
IMEx consortium (10) was formed to facilitate interchange of information between different primary databases 
by using a standardized format. The Proteomics Standards Initiative Molecular Interaction (PSl-Ml) for- 
mat (11) allows a standard way to represent protein interaction information. One of its salient features is 
the controlled vocabulary of terms that can be used to describe various facets of a protein-protein interaction 
including source database, interaction detection method, cellular and experimental roles of interacting proteins 
and others. The PSI-MI vocabulary is organized as an ontology, a directed acyclic graph (DAG), where nodes 
correspond to terms and links to relations between terms. This enables the terms to be related in an efficient 
and algorithm-friendly manner. 

Consistently annotated datasets are useful for development and assessment of interaction prediction tools (12, 
13, 14, 15). Furthermore, such datasets also form the basis of interaction networks, for which numerous anal- 
ysis tools have been developed (16, 17). Depending on biological aims of a tool, different entities (nodes) and 
potentially weighted interactions (edges) may be preferred. The chance of conflicting predictions from differ- 
ent tools can be reduced by starting from a consistently annotated dataset that faithfully represents all available 
evidences. Such dataset ought to be comprehensive but also non-redundant: the same experimental evidence 
for an interaction should appear once and only once. To maintain a coherent development of biological under- 
standing, it is indispensable to keep the reference datasets up-to-date. 

We examined several primary interaction databases with the aim of constructing non-redundant (in terms of 
evidence), consistently annotated and up-to-date reference datasets of physical interactions for several model 
organisms. Unfortunately, the common standard format used by most primary databases still does not allow 
direct compilation of full non-redundant interactomes. This mainly results from the fact that different primary 
databases may use different identifiers for interacting proteins and different conventions for representing and 
annotating each interaction. Combining interaction data from BIND (6, 7) (in two versions called 'BIND' and 
'BIND.Translation'), BioGRID (2), CORUM (18), DIP (5), HPRD (8), IntAct (3), MINT (4), MPact (19), 
MPPI (20) and OPHID (21), the iReflndex (22) database represents a significant advance towards a complete 
and consistent set of all publicly available protein interactions. Apart from being comprehensive and relatively 
up-to-date, the main contribution of iReflndex is in addressing the problem of protein identifiers by mapping 
the sequence of every interactant into a unique identifier that can be used to compare interactants from different 
source databases. In a further 'canonicahzation' procedure (23), different isoforms of the same protein are 
mapped to the same canonical identifier. By adhering to the PSI-MI vocabulary and file format, iReflndex 
provides largely standardized annotations for interactants and interactions. Construction of iReflndex led to 
the development of iRefWeb, a web interface for interactive access to iReflndex data (23). iRefWeb allows an 
easy visuaUzation of evidence for interactions associated with user-selected proteins or pubUcations. Recently, 
the authors of iReflndex and iRefWeb published a detailed analysis of agreement between curated interactions 
within iReflndex that are shared between major databases (24). 

However, aiming to maintain all information from original sources, iReflndex requires users to perform 
additional processing to fully achieve the aforementioned goals. In particular, iReflndex considers redundancy 
in terms of (unordered) pairs of interactants rather than in terms of experimental evidence associated with an 
interaction. Consequently, there will be features one desires to have that may not fit well within the scope of 
iReflndex. For example, one may wish to treat interactions arising from enzymatic reactions as directed and to 
be able to selectively include/exclude certain types of reactions such as acetylation. In many cases, the infor- 
mation about post-translational modifications is available directly from source databases, but is not integrated 
into iReflndex. Another issue that propagates into iReflndex from source databases has to do with protein 
complexes. Some databases represent experimentally observed complexes as interactions with more than two 
participants, while others expand them into binary interactions using spoke or matrix model (1). Turinsky et al. 
(24) recently observed that this different representation of complexes is responsible for a significant number of 
disagreements between major databases curating the same pubhcation. From our earher work (25), we found 
that such expanded complexes may lead to nodes with very high degree and often introduce undesirable short- 
cuts in networks. To fairly treat the information provided by protein complexes without exaggeration, it is 
preferable to replace the expanded interactions, either from spoke or matrix models, with a flat list of complex 
members. Additionally, we discovered that the mapping of each protein to a canonical group by iReflndex 
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would sometimes place protein sequences clearly originating from the same gene (for example differing in one 
or two amino acids) into different canonical groups. 

To achieve the goal of constructing non-redundant, consistentiy annotated and up-to-date reference datasets, 
we developed a script, called ppiTrim, that processes iReflndex and produces a consoUdated dataset of physical 
protein-protein interactions within a single organism. 

Materials and Methods 

Our script, called ppiTrim, is written in the Python programming language. It takes as input a dataset in 
iReflndex PSl-Ml TAB 2.6 format, with 54 TAB-delimited columns (36 standard and 18 added by iReflndex). 
After three major processing steps, it outputs a consolidated dataset, in PSI-MI TAB 2.6 format, containing only 
the 36 standard columns (Supplementary Table 1). The three processing steps are: (i) mapping all interactants 
to NCBl Gene IDs and removing all undesired raw interactions; (ii) deflating potentially expanded complexes; 
and (iii) collecting all raw interactions, originated from a single publication, that have the same interactants and 
compatible experimental detection method annotations into one consolidated interaction. At each step, ppiTrim 
downloads the files it requires from the public repositories and writes its intermediate results as temporary files. 

Phase I: initial filtering and mapping interactants 

In Phase 1, ppiTrim takes the original iReflndex dataset and classifies each raw interaction (either a binary 
interaction corresponding to a single line in the input file or a complex supported by several fines) into one of 
four distinct categories: removed (not examined further), biochemical reaction, complex or potentially part of 
a complex, and other (direct binary binding interaction). It removes interactions marked as genetic, originating 
from publications specified through a cormnand fine parameter or having interactants from organisms other 
than the main species of the input dataset (the allowed species can be explicitly provided or any interaction with 
interactants having different Taxonomy IDs is removed). Additionally, ppiTrim removes all interactions from 
OPHID and the 'original' BIND. The former is removed because it contains either computationally predicted 
interactions or interactions verified from the literature using text mining (i.e. without human curation). The 
latter is removed because it processes the same original dataset as BlND_Translation (7). 

As a first step, the script seeks to map each interactant to an NCBI Entrez Gene (26) identifier. For 
most interactants, it uses the mapping already provided by iReflndex. In the cases where iReflndex pro- 
vides only a Uniprot (27) knowledge base accession, the script attempts to obtain a Gene ID in three different 
ways. First, it searches the iReflndex mappings . txt file (found compressed in ftp . no . embnet . org/ 
iref index/data/current/Mappingf iles/ for any additional mappings. This part is optional be- 
cause the mappings . txt file is very large even compressed and it would not be feasible to perform auto- 
matic download each time ppiTrim is run. Second, for all unmapped Uniprot IDs, it retrieves the corresponding 
full Uniprot records using the dbfetch tool from EBI (www . ebi . ac . uk/Tools/dbf etch). If a direct 
mapping to Gene ID is present within the record as a part of DR field, it is used. Otherwise, the canonical gene 
name (field GN) is used to query the NCBI Entrez Gene database for a matching Gene record using an Eutils 
interface. If a single unambiguous match is found, the record's Gene ID is used for the interactant. No mapping 
is performed if multiple matches are obtained. Every mapped Gene ID is checked against the list of obsolete 
Gene IDs, which are no longer considered to have a protein product existing in vivo. The interactants that can- 
not be mapped to valid (non-obsolete) Gene IDs are removed along with all raw interactions they participate 
in. 

After assigning Gene IDs, the script considers the PSI-MI ontology terms associated with interaction de- 
tection method, interaction type and interactants' biological roles. Using the full PSI-MI ontology file in Open 
Biomedical Ontology (OBO) format (28), it replaces any non-standard terms in these fields (labeled MLOOOO) 
with the corresponding valid PSI-MI ontology terms. The terms marked as obsolete in the PSI-MI OBO file 
are exchanged for their reconnmended replacements (Supplementary Table 2). The single exception are the 
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interaction detection method terms for HPRD 'in vitro' (MI:0492, translated from MI:0045 label in iReflndex) 
and 'in vivo' (MI:0493) interactions, which are kept throughout the entire processing. 

Source interactions annotated with a descendant of the term MI:0415 (enzymatic study) as their detection 
method or with a descendant of the term MI:0414 (enzymatic reaction) as their interaction type are classified as 
candidate biochemical reactions. This category also includes any interactions (including those with more than 
two interactants) where one of interactants has a biological role of MI:0501 (enzyme) or MI:0502 (enzyme 
target). In the recent months, the BioGRID database has started to provide additional information about the 
post-translational modifications associated with the 'biochemical activity' interactions, such as phosphoryla- 
tion, ubiquitination etc. This information is available from the BioGRID datasets in the new TAB2 format but 
is not yet reflected in the PSI-MI terms for interaction type provided in the PSI-MI 2.5 format or in iReflndex. 
Since the post-translational modifications annotated by the BioGRID can be directly matched to standard PSI- 
MI terms (Supplementary Table 3), the script downloads the most recent BioGRID dataset in TAB2 format, 
extracts this information and assigns appropriate PSI-MI terms for interaction type to the candidate biochemical 
reactions from iReflndex that originate from the BioGRID. 

Any source interaction not classified as candidate biochemical reaction is considered for assignment to 
the candidate complex categories. This category includes all true complexes (having edge type 'C in iRe- 
flndex), interactions having a descendant of MI:0004 (affinity chromatography) as the detection method term 
or MI:0403 (colocalization) as the interaction type, as well as the interactions corresponding to the BioGRID's 
'Co-purification' category. Interactions with interaction type MI:0407 (direct interaction) are never considered 
candidates for complexes. All source interactions not falling into candidate biochemical reaction or candidate 
complex categories are considered ordinary binary physical interactions. 

Phase II: deflating spoke-expanded complexes 

The Phase II script attempts to detect spoke-expanded complexes from 'candidate complex' interactions and 
deflate them into interactions with multiple interactants. First, all candidate interactions are grouped accord- 
ing to their publication (Pubmed ID), source database, detection method and interaction type. Each group of 
source interactions is turned into a graph and considered separately for consolidation into one or more com- 
plexes. When a portion of a group of interactions is deflated, we replace these source interactions by a complex 
containing all their participants. Each collapsed complex is represented using bipartite representation in the 
output MITAB file (the same as the original complexes from iReflndex, but using newly generated complex 
IDs) and the references to the original source interactions are preserved (Supplementary Table 1). Two proce- 
dures are used for consolidation: pattern detection and template matching (Fig. 1). The deflation algorithm for 
each new complex is indicated in the output file through its edge type (Table 1). 

Pattern detection procedure is used only for the interactions from the BioGRID. Unlike the interactions 
from the DIP, those interactions are inherently directed since one protein is always labeled as bait and other 
as prey (in many cases this labeling is unrelated to the actual experimental roles of the proteins). The pattern 
indicating a possible spoke-expanded complex consists of a single bait being linked to many preys. Since all 
interactions in the BioGRID's 'Co-purification' and 'Co-fractionation' categories arise from complexes that are 
spoke-expanded using an arbitrary protein as a bait (BioGRID Administration Team, private communication), 
a bait linked to two or more preys can in that case always be considered an expanded complex and deflated. 
Such deflated complexes are assigned the edge type code 'G'. The remainder of the complex candidate inter- 
actions from the BioGRID were obtained by affinity chromatography and are, in most cases, also derived from 
complexes. Here we adopted a heuristic that a bait linked to at least three preys can be considered a complex. 
Clearly, some experiments involve a single bait being used with many independent preys, in which case this 
procedure would generate a false complex. Therefore, complexes generated in this way are assigned a different 
edge type code ('A') and the user is able to specify specific publications to be excluded from consideration as 
weU as the maximal size of the complex. 

The second procedure is based on matching each group of candidate interactions to the complexes indicated 
by other databases (templates), mostly from IntAct, MINT, DIP and BIND. In this case, the script checks for 
each protein in the group whether it, together with all its neighbors, is a superset of a template complex. If so. 
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Figure 1 : ppiTrim uses two procedures for complex deflation: pattern detection (top) and template matciiing (bottom). As an example, 
assume that a graph ABCDEFG, shown on the left, could be constructed from complex candidate interactions annotated by the BioGRID 
from a single publication. The aiTows indicate bait to prey relationships, with the interaction A-D being repeated twice, once with A and 
once with D as a bait. Pattern detection algorithm (top) would recognize A and D as hubs of potentially spoke-expanded complexes and 
thus replace all pairwise interactions on the left with complexes ABCDEF and ACDEFG. Suppose that the complex ACDEF was reported 
from the same pubhcation by a different database. Then, template matching procedure (bottom) would generate the coinplex ACDEF (with 
all other annotation, such as experimental detection method, retained from the original interactions) and remove all original interactions 
except D-G and A-B. After performing both procedures, ppiTrim consolidates the results so that the overall result would be replacing the 
original interactions by complexes ACDEF, ABCDEF and ACDEFG with edge type codes 'R', 'A' and 'A', respectively. The interactions 
A-B and D-G would not be retained since they are contained within the deflated complexes ABCDEF and ACDEFG. 



all the candidate interactions between the proteins within the complex are deflated. The neighborhood graph is 
undirected for all source databases except the BioGRID. The new complexes generated in this way are given 
the code 'R'. The scripts also attempts to use complexes generated from the BioGRID's interactions through a 
pattern detection procedure as templates, in which case the newly generated complexes have the code 'N'. Any 
source interactions that cannot be deflated into complexes are retained for Phase III. 

Phase III: Normalizing interaction type annotation 



Overview 

The goal of the final phase of ppiTrim is to consolidate all evidence for an interaction, obtained from a single 
experiment, into one consolidated interaction record. Every source publication contains descriptions of one or 
more experiments that result in reported interactions. Unfortunately, distinct experiments within each publi- 
cation are not annotated in all source databases, with the exception of the interactions from IntAct and MINT 
that appear to distinguish experiments using a numbered suffix to the author's name in the 'Author' field. It is 
therefore necessary to rely on the experimental detection method terms to determine whether source records 
from different databases, with the same interactants and source publication, represent the evidence for the same 
interaction. Ideally, all such records with the same detection method can be collapsed into one consolidated 
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Table 1: Edge type codes used by ppiTrim 



Code Description 



R 
A 
N 



X 
D 
B 
C 
G 



undirected binary interaction (piiysical binding) 
directed binary interaction (biochemical reaction) 
biochemical reaction without indication of directionality 
original complex (from IReflndex) 

spoke-expanded complex; deflated by pattern matching from BioGRID's 'Co-purification' and 'Co-fractionation' 
categories (reliable) 

potential spoke-expanded complex; deflated by template matching of a 'C'-complex 
potential spoke-expanded complex (BioGRID only); deflated by pattern detection 
potential spoke-expanded complex; deflated by template matching of a 'G'- or 'A' -complex 



interaction, although this may undercount multiple evidences from the same publication obtained by distinct 
experiments. However, different databases have different annotation policies and do not necessarily use the 
same PSI-MI term to annotate a given experimental method. To resolve detection method term disagreements, 
we use the PSI-MI ontology structure (Fig. 2). Two compatible terms assigned by different source databases 
are considered to represent the same experimental method within a publication. These annotated records are 
thus consolidated. 

The Phase III algorithm proceeds as follows. All source interactions and complexes (original as well as 
deflated in Phase II) are divided into 'clusters' . Interactions that share the same interactants and the source 
publication are placed into the same cluster. The order of interactants is significant only for biochemical 
reactions, which are treated as directed interactions (only when direction can be ascertained). Each cluster is 
processed independently and divided into subc lusters based on compatibihty of the PSI-MI terms for interaction 
detection method. Interactions from each subcluster are collected into a single consolidated interaction, which 
is output to the final dataset. The consolidated record preserves references to all original interactions. Each 
consolidated interaction is assigned a single PSI-MI term for interaction detection method that most specifically 
describes the entire collection of annotation terms within the subcluster. For easier reference, each consolidated 
interaction is given a unique ppiTrim ID, which is similar to RIGID from iReflndex. This is a SHAl hash of 
a dot-separated concatenation of its interactants (Gene IDs), publication(s), detection method, interaction type 
and edge type. Every complex uses its ppiTrim ID as its primary ID. 



Reconciling annotation 

The DAG structure of an ontology naturally induces a partial order between the terms: for two terms u 
and V, we say that u refines v (u is smaller v, u precedes v) if there exists a directed path in the DAG from 
u to V. Two PSI-MI terms can be considered compatible if they are comparable, that is, one refines the other. 
Every nonempty collection of terms U can be uniquely split into disjoint sets Ui, such that every Ui has a 
single maximal element (an element comparable to and not smaller than any other member) and contains all 
members of U comparable to its maximal element. Every subcollection Ui is then consistent because there 
exists at least one term within it that can describe all its members, while any two members from different 
subcoUections are incomparable. The finest consistent term of a subcollection Ui is the smallest member of Ui 
that is comparable to all its members (it can also be defined as the smallest member of the intersection of the 
transitive closures of all the members of C/j.). If Ui is a total order, where all members are pairwise comparable, 
the finest consistent term is the minimal term. On the other hand, the minimal term need not exist (Fig. 2), so 
that the finest consistent term is higher in the hierarchy and represents the most specific annotation that can be 
assigned to Ui as a whole. 

To produce consolidated interactions from a single cluster, each of its members (interactions) is identified 
with its PSI-MI term for information detection method. For every cluster member, the set of all other members 
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with compatible annotations ('compatible set') is computed. As a special case, the following detection method 
tags are treated as smaller than any other: 'unspecified method' (MI:0686), 'in vivo' and 'in vitro' (The latter 
two are from HPRD only). In this way, non-specific annotations are considered as compatible with all other, 
more specific evidences. Compatible sets are further grouped according to their maximal elements. Within each 
group, the union of the compatible sets produces a subcluster. The finest consistent term for each subcluster is 
found by considering all PSI-MI terms on the paths from the subcluster members to its maximum - the search 
is not restricted to those terms that are within the subcluster (Fig. 2). 

Conflicts 

We consider two subclusters of the same cluster to be in an unresolvable conflict if there is no source 
database shared between them. This definition takes into account that a source database may report an inter- 
action several times for the same publication, using the same or different interaction detection method. If two 
databases annotate the same interaction using incompatible terms, this is most likely due to an error or spe- 
cific disagreement about the appropriate label, rather than that each database is reporting a different experiment 
from the same publication. Unresolvably conflicting interaction records, after consolidation, point to each other 
using ppiTrim ID in the 'Confidence' field. 

ppiTrim also collects statistics about resolvable conflicts in its temporary output files. A resolvable conflict 
is the case where source interactions within a single subcluster have compatible but different experimental 
detection method labels. 




Figure 2: The picture shows a part of the PSI-MI ontology graph for interaction detection method associated with a hypothetical cluster 
of source interactions involving the same interactants from the same publication. The terms colored blue are associated with the source 
interactions within the cluster, while those marked yellow and green are present in the ontology but do not label any source interaction from 
the cluster The entire cluster as shown is consistent, with the term MI:0401 as the maximal eleinent. Its finest consistent term is MI:0004 
(colored green) since the cluster members smaller than it are not comparable between themselves. Removing the source interactions 
labeled by MI:040I from the cluster would result in three distinct subclusters. If two subclusters contain no interaction from the same 
source database, they would be reported as conflicts. 
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Table 2: Processing source interactions 



Species 


Initial 


Removed 


Without Gene ID 


Retained 


With Mapped Gene ID 


S. cerevisiae 


400449 


173815 


3608 


223026 


880 


H. sapiens 


382094 


148724 


2738 


230632 


16187 


D. melanogaster 


154770 


32477 


9476 


112817 


3427 



Statistics of initial processing of raw interactions from iReflndex. Shown are the initial nvmiber, total number removed due to filtering 
criteria, number removed due to missing Gene ID, total number of retained and the number retained containing at least one interactant with 
mapped Gene ID. 



Table 3: Mapping CROGID identifiers from iReflndex into Gene IDs 



Species 


total 


Initial CROGIDs 
mapped 


orphans 


Aditional Mapped 
total valid 


Final 

CROGIDs 


Gene IDs 


S. cerevisiae 


6159 


5552 


607 


433 


47 


5599 


5618 


H. sapiens 


14047 


11432 


2615 


1261 


1261 


12693 


11786 


D. melanogaster 


9379 


7810 


1569 


566 


566 


8346 


7846 



Statistics of mapping CROGIDs into Gene IDs. Colvmms 2-4 show the total number of CROGIDs considered, the number that could be 
directly mapped to GenelDs and the number of 'orphans' that are not associated with a Gene ID in the iReflndex file. Columns 5 and 6 
show the numbers of CROGIDs additionally mapped to GenelDs, while the last two colvmms show the final number of CROGIDs accepted 
and the corresponding number of Gene IDs. It is possible for a CROGID to map to multiple Gene IDs (if multiple genes encode the same 
protein sequence) as well as for multiple CROGIDs to map to a single GenelD (if our additional mapping links them to the same gene). 



Evaluation of the script 

To test ppiTrim, we applied it to the yeast {S. cerevisiae), human (H. sapiens) and fruitfly {D. melanogaster) 
datasets from iReflndex release 8.0-beta, dated Jan 19th 201 1. The script was run on June 13th 201 1 and used 
the then-current versions of Uniprot and NCBI Gene databases. We restricted protein interactors to allowed 
NCBI Taxonomy IDs: 4932 and 559292 for yeast, 9606 for human, and 7227 for fruitfly datasets. When 
processing the yeast dataset, we accounted for two special cases. First, we specifically removed the genetic 
interactions reported by Tong et al. (29) because they were not labeled as genetic for all source databases. 
Second, we excluded the dataset by Collins et al. (30) from Phase II and retained all its interactions as binary 
undirected. This dataset is present only in the BioGRID and can be considered computationally derived and 
partially redundant. Collins et al. (30) reprocessed the data from Gavin et al. (31) and Krogan et al. (32) to 
obtain an improved set of pairwise interactions. Collins et al. (30) used hierarchical clustering to recover protein 
complexes, but these are not present in the BioGRID. In spite of its redundancy, we decided not to entirely 
remove this dataset but also not to attempt to deflate its potential complexes because bait/prey assignments may 
not be meaningful in this case. 

Results and Discussion 

The results of applying ppiTrim to process iReflndex 8.0 are shown in Tables 2-5. The statistics of ID mapping 
(Tables 2 and 3) show that a considerable number of interactants could be additionally mapped to Gene ID in 
human and fruitfly datasets, thus enabling us to take into consideration a few thousand of raw interactions that 
would otherwise be filtered. This is also evident in terms of iReflndex RIGIDs (Supplementary Table 4), which 
associate all raw interactions with interactants with same sequences to a single record. For yeast, the number 
of interactions gained by mapping to Gene IDs is small because most of mapped IDs were not valid. 

We chose to standardize proteins using NCBI Gene identifiers rather than the iReflndex-provided canonical 
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Table 4: Deflating spoke-expanded complexes 



Species 


Publications 


initial 


Pairs 

remaining 


C 


G 


Complexes 
R 


A 


N 


S. cerevisiae 


3924 


118819 


28643 


7729 


323 


5384 


3190 


1311 


H. sapiens 


10317 


56111 


35650 


8382 


181 


1143 


1443 


304 


D. melanogaster 


398 


1722 


1053 


220 


16 


82 


33 


3 



Shown are the nimibers of complexes obtained by deflating binary interactions with affinity chromatography (or related) as experimental 
method. Types of complexes are indicated by one letter codes described in Table 1. The counts of pairs shown include those from 
publications with fewer than three interactions (per database), which could never be deflated into complexes. 



IDs (CROGIDs) for several reasons. NCBI Gene records not only associate each gene with a set of reference 
sequences, but also include a wealth of additional data (e.g. list of synonyms) and links to other databases 
such as Gene Ontology (33) that are important when using the interaction dataset in practice. In addition, Gene 
records are regularly updated and their status evaluated based on new evidence. Thus, a gene record may be 
split into several new records or marked as obsolete if it corresponds to an ORF that is known not to produce 
a protein. For network analysis apphcations, it is desirable that only the proteins actually expressed in the 
cell are represented in the network and hence the gene status provided by NCBI Gene is a valuable filtering 
criterion. Our results in yeast (Table 3) support this premise: most CROGIDs without Gene ID are associated 
with sequences derived from ORFs that were subsequently declassified as genes. However, CROGIDs do have 
one advantage over NCBI Gene IDs in that they are protein-based and hence identical protein products of 
several genes (like histones) are clustered together. 

There are several reasons that our algorithm was able to introduce many additional associations of CROGIDs 
to Gene IDs. First, iReflndex only provides mappings to Gene IDs for interactors that have a sequence that 
exactly matches a sequence in an NCBI RefSeq record (Ian Donaldson, private communication). By a case- 
by-case examination of some orphaned yeast sequences that could be mapped to Gene ID, we found that they 
were orphans because they differed in one or two amino acids from that protein's reference representative 
in RefSeq but were not clustered with that representative's Gene record. Additional mappings can be found 
through database cross-reference from a Uniprot record pointing to a Gene ID. The iReflndex canonicaliza- 
tion procedure captures some of these associations in the mappings . txt file but they are not available in 
the main iReflndex MITAB files. We have found (Supplementary Table 5) that some CROGIDs (mostly in 
human) can be additionally mapped by using this information in the mappings . txt file. Notably, ppiTrim 
accesses a more recent version of Uniprot then iReflndex and is thus able to find more mappings by accessing 
Uniprot cross-references directly. Finally, there is a substantial number of Uniprot records that do not have a 
cross-reference to NCBI Gene but can be hnked to a Gene record through their canonical gene names. This 
last approach can be suggested as an improvement for iReflndex canonicalization processing. 

Around 10% of CROGIDs could not be mapped to Gene IDs even after processing with ppiTrim algo- 
rithms. A few interactors (Supplementary Table 5) have only PDB accessions as their primary IDs since their 
interactions were derived from crystal structures. In such cases, often only partial sequences of participating 
proteins are available. These partial sequences cannot be fully matched to any Uniprot or RefSeq record and 
hence are assigned a separate ID. Hence, an improvement for our procedure, that would account for this case as 
well as for those unmapped proteins that differ from canonical sequences only by few amino acids, would be to 
use direct sequence comparison to find the closest valid reference sequence. This task may not be technically 
difficult (a similar procedure was apphed by Alves et al. (34) to construct protein databases for mass spectrom- 
etry data analysis) but is beyond the scope of ppiTrim, which is intended as a relatively short standalone script. 
In our opinion, such additional mappings would best be performed at the level of reference sequence databases 
such as Uniprot or RefSeq, which contain curator expertise to resolve ambiguous cases. 

Protein complexes obtained through chromatography techniques provide information complementary to 
direct binary interactions. While it is often difficult to determine the exact layout of within-complex pairwise 
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interactions, an identification of an association of several proteins using mass spectroscopy is an evidence for 
in vivo existence of that association. Unfortunately, in spite of its great importance, the currently available 
information within iReflndex is deficient because of different treatments of complexes by different source 
databases. Our results (Table 4) show that the apparendy inflated complexity of interaction datasets can be 
substantially reduced by attempting to collapse spoke-expanded complexes. For yeast, this results in almost 
three quarters reduction of the number of candidate interactions. The majority of new complexes falls into 
'G' and 'R' categories, which can be considered most reliable. For the human dataset, reduction is small as a 
proportion although in absolute terms the number of new complexes is over 3000. The fruitfly dataset did not 
contain many candidate interactions or complexes and hence not many new complexes were obtained. 

In general, it is difficult to assess whether newly generated complexes from 'A' and 'N' categories are 
biologically justified, that is, whether they represent a functional entity. If a bait and its preys genuinely 
originate from a single experiment, they definitely form a physical association that may be a part of or an 
entire functional complex. Since ppiTrim preserves the experimental role labels and the original interaction 
identifiers, httle information is lost by deflating such associations into a single record. On the other hand, for 
some pubUcations, especially those involving experiments with ubiquitin-Uke proteins as bait, each bait-prey 
association may represent a separate experiment and it does not substantiate that different prey proteins may be 
co-present in the cell. For example, BioGRID provides 158 physical associations from the paper by Hannich 
et al. (35), each involving the yeast Smt3p (SUMO, a ubiquitin-like) protein as a bait. In this case, it is not true 
that all the involved preys together form a large complex with the bait. ppiTrim avoids this particular case by 
not deflating potentially too large complexes (the maximum deflated complex size is tunable by the user with 
the default of 120 proteins), but one can assume that some of deflated 'complexes' do not exist in vivo. 

To more closely investigate the fidelity of generated complexes, we randomly sampled 25 'A' and 'N' de- 
flated yeast complexes from the final output of ppiTrim and examined their original publications. Out of these 
25 complexes, 15 originated from high-throughput pubUcations (mostly Gavin et al. (31) and Krogan et al. 
(32) - Supplementary Table 6), while 10 came from small experiments (Supplementary Table 7). In all high- 
throughput cases, the deflated complex represents a true experimental association. In the cases when authors 
present their own derived complexes, which in many cases can be found separately under the 'C category, our 
deflated complexes form parts of larger derived complexes. Indeed, such derived complexes are obtained by 
assembling the results of several bait-prey experiments, each of which forms a single deflated complex. The re- 
sults are more varied for low-throughput publications. In most cases, deflated complexes clearly correspond to 
functional complexes, although it is sometimes difficult to fully relate author's conclusions with their reported 
results. In two cases, the inferred association is incorrect due to curation errors in the original database. We 
have also found a single case where the pubUcation authors directly state that proteins in a deflated complex do 
not form a stable complex. 

While our sample is extremely small, it does indicate several issues arising from deflation of bait-prey 
relationships. In most cases, deflated complexes form parts of what are believed to be functional complexes. It 
appears that curation errors or ambiguities may be a more significant source of wrongly inferred associations 
than our main assumption that a bait with several preys in a single publication represents a single unit. Overall, 
we feel that the benefits from reduction of interactome complexity outweigh the disadvantages from potentially 
over deflating interactions. The best way to solve the problem of different representations of protein complexes 
would be at the level of source databases (BioGRID in particular), by reexamining the original pubUcations. 
Our complexes from the 'R' category, where deflated complexes fully agree with an annotated complex from a 
different database, could serve as a guide in this case. 

Overall, our processing significantly reduced the number of interactions within each of the three datasets 
considered (Table 5). This indicates a significant redundancy, particularly for protein complexes, original and 
deflated (compare Table 4 with Table 5), and for binary interactions. The directed interactions (biochemical 
reactions) are relatively rarer and largely non-redundant at this stage. Given their importance in elucidating 
biological function, the directed interactions are expected to be discovered more fully with time. However, one 
should note that PSI-Ml format can only represent a static relationship among a set of physical entities involved 
in the same event, but cannot actually represent two sides of a reaction e.g. A-\- B ^ C + D. Certain pairs of 
PSI-MI biological role terms can be combined to represent interaction direction e.g. enzyme and enzyme target. 
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Table 5: Final consolidated datasets 



Species 


Publications 


Input Pairs 
biochem other 


complexes 


Consolidated 
directed 


undirected 


Conflicts 
resolvable unresolvable 


5. cerevisiae 


6303 


5780 


119329 


10778 


5525 


63648 


19344 


454 


H. sapiens 


22660 


2446 


199094 


6483 


2042 


85480 


26478 


1333 


D. melanogaster 


564 


51 


111862 


227 


33 


27981 


19430 


11 



For each species, shown are the numbers of input pairs (input complexes are those from Table 4), classified as either biochemical reactions 
(potentially directed) or others; also shown are the final numbers of consolidated interactions (classified as complexes, directed or undi- 
rected). The 'other' column accounts only for those interactions that were not deflated into complexes in Phase 11. The last two columns 
show the total numbers of resolvable and unresolvable conflicts between consolidated interactions. An unresolvable conflict is an instance 
where two consolidated interactions, originated from the same publication, are reported using incompatible experimental detection method 
labels by different databases. A resolvable conflict is the case where source interactions within a single consolidated interaction have 
different (but compatible) experimental detection method labels. 



Table 6: Most common interaction detection method PSI-MI term conflicts 



TennA 


Sources A 


TermB 


Sources B 


Cotmts 


MI:0007 (anti tag coimmunoprecipitation) 


M 


MI:0676 (tandem affinity purification) 


DI 


132 


MI:0004 (affinity chromatography) 


B 


MI:0363 (inferred by author) 


I 


60 


MI:0018 (two hybrid) 


DIMN 


MI:()096 (pull down) 


BI 


43 


MI:0071 (molecular sieving) 


DIN 


MI:()()96 (pull down) 


B 


32 


MI:0030 (cross-linking study) 


DIMN 


MI:()096 (pull down) 


B 


22 


MI:0007 (anti tag coimmunoprecipitation) 


IM 


MI:0676 (tandem affinity purification) 


DI 


1227 


MI:0018 (two hybrid) 


BDHIM 


MI:0096 (pull down) 


BM 


17 


MI:0096 (pull down) 


B 


MI:0I07 (surface plasmon resonance) 


DM 


6 


MI:0008 (array technology) 


I 


MI:0049 (filter binding) 


M 


5 


MI;0019 (coimmimoprecipitation) 


IM 


MI:0096 (pull down) 


BI 


5 



Top five most common interaction detection method PSl-MI term unresolvable conflicts for yeast (top) and human (bottom) datasets are 
shown. Source databases are indicated by one letter codes B (BioGRID), D (DIP), I (IntAct), H (HPRD), M (MINT), P (MPPI). 
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but these are weak compared to the rich ways that pathway databases like Reactome (36) represent events. 

To demonstrate the utility of our conflict resolution method, we present the counts for resolvable and unre- 
solvable conflicts in Table 5. Resolvable conflicts significantly outnumber the unresolvable ones. Examining 
the most common examples of resolvable conflicts (Supplementary Table 8), one can see that a majority of 
them indeed represent the same experiment. Possible exceptions are human interactions annotated by HPRD, 
which have ambiguous detection method labels. To address this and similar problems, ppiTrim provides the 
maxsources confidence score (Supplementary Table 1), which is an estimate of the maximal number of inde- 
pendent experiments contributing to a consoUdated interaction. An interesting example of a resolvable conflict 
in Supplementary Table 8 is the 444 instances of a consolidated interaction containing source interactions with 
detection method labels Ml:0004 (affinity chromatography technology), Ml:0007 (anti tag coimmunoprecipi- 
tation), and MI:0676 (tandem affinity purification). This case is very similar to the one described in Figure 2: 
the last two terms are incompatible but the first resolves the conflict as the finest consistent term. 

Upon closer examination of the few unresolvable conflicts (Table 6), it can be seen that most common 
conflicts arise as instances of few specific labeUng disagreements between databases. In many cases, such 
disagreements arise from using different sub-terms of affinity chromatography (see Fig. 2) and can be resolved 
by assigning a more general term consistent with both conflicting terms. In many other cases, the conflicts are 
due to BioGRID internally using a more restricted detection method vocabulary than the IMEx databases (DIP, 
IntAct and MINT). However, in some rare cases, an unresolvable conflict arises when different databases an- 
notate different experiments from the same publication. For example, each of DIP, BioGRID and IntAct report 
several raw interactions from the paper by Blaiseau and Thomas (37) (pubmed:9799240), where yeast Met4p 
protein interacts with each of Met28p, Met31p and Met32p in binary interactions. The paper reports several ex- 
periments using different techniques including northern blotting, yeast two hybrid and electrophoretic mobility 
shift assays. For the interaction between Met4p and Met28p, BioGRID and IntAct report only MI:0018 (yeast 
two hybrid) method, while DIP reports only MI:0404 (comigration in non denaturing gel electrophoresis), re- 
sulting in unresolvable conflict. Hence, in this case, each database on its own provides incomplete evidence for 
this interaction. 

The ppiTrim algorithms work best if accurate and fully populated fields for interaction detection method, 
publication and interaction type are available in its input dataset. This requirement is mostly fulfilled. Never- 
theless, we have noticed two minor inconsistencies. The first, which will be fixed in a subsequent release of 
iReflndex (Ian Donaldson, private communication), involves the PSI-MI labels for interaction detection method 
for CORUM interactions and complexes. These are missing from iReflndex although they are present in the 
original CORUM source files. The second issue concems missing or invalid Pubmed IDs for certain interac- 
tions. We found that a number of interactions with missing Pubmed IDs come from MINT. Upon inspection of 
the original MINT files, we discovered that in many cases MINT supplies a Digital Object Identifier (DOI) for 
a pubhcation as its identifier instead of a Pubmed ID (although the corresponding Pubmed ID can be obtained 
from the MINT web interface). To ensure consistency with other source databases within iReflndex, it would 
be desirable to have the Pubmed IDs available for these interactions as well. 

In this paper, we have identified the tasks needed for using combined interaction datasets provided by iRe- 
flndex as a basis for construction of reference networks and developed a script to process them into consistent 
consolidated datasets. We see ppiTrim as answering a temporary need for a consolidated database and hope 
that most of the issues that required processing will be eventually fixed in upstream databases and distributed 
through IMEx consortium. At this stage we have not addressed the issue of quality of interactions although 
such information is available in some databases for some publications (23). Utilizing the quality information 
in consoUdating datasets demands a universal data-quality measure that is not yet existent. 
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Supplementary Table 1: Description of ppiTrim MITAB 2.6 columns 



Column 


Short Name 


Description 


Example 




I 


uidA 


Smallest Gene ID of the interactor A*t 


ent re z gene /locus link : 854 547 




2 


uidB 


Smallest Gene ID of the interactor B* 


entrezgene/locuslink: 855136 




3 


altA 


All gene IDs of the interactor A* 


ent re z gene /locus link : 854 547 




5 


aliasA 


All canonical gene symbols and integer 
CROGlDs of interactor A 


entrezgene/locuslink : BNRl | 
icrogid:2105284 




6 


aliasB 


All canonical gene ^^^Is and integer 


entrezgene/locuslink:MY05 I 




7 


method 


PSI-MI term for interaction detection 
method 


MI: 0018 (two hybrid) 




8 


author 


First author name(s) of the publication in 
which this interaction has been shown-t 


Tong AH 




9 


pmids 


Pubmed lD(s) of the publication in which 
this interaction has been shown 


pubmed: 11743162 




10 


taxA 


NCBI Taxonomy identifier for interactor A 


taxid:4932 (Saccharomyces cerevisiae) 




11 


taxB 


NCBI Taxonomy identifier for interactor B 


taxid:4932 (Saccharomyces cerevisiae) 




interactionType 


PSl-MI term for interaction type 


MI : 0407 (direct interaction) ^^^^^ 




13 


sourcedb 


PSI-MI terms for source databases* 


MI : 0000 (MPACT) I MI : 0463 (grid) 
lYii I u H bo (Qip/ 1 Ml I u^oy (intact ) 


L 


14 


interactionldentifier 


A hst of interaction identifiers* 


ppiTrim :tyuGkS0K2 3 ldh3YnSi6GbczJCFE- | 
MPACT : 82 33 | dip :DIP-11198E | grid: 147506 
intact :EBI- 60 15 65 | intact :EBI-601728 | 
,.4ti.qia;.;2M..990|edqetvpe,:X. _ .. _ 




15 


confidence 


A list of ppiTrim confidence scores* 


maxsources : 2 | dmconsistency : full | 
conflicts :S3oaiXt5tA4vVrUs01rclTA9krk= 


ks_ 


16 


expansion 


Either 'none' for binary interactions or 'bi- 
partite' for subunits of complexes 


none 




17 


biologicalRoleA 


PSI-MI term(s) for the biological role of in- 
teractor A-f 


MI : 0499 (unspecified role) 




18 


biologicalRoleB 


PSI-MI term(s) for the biological role of in- 
teractor R t _ , _ 


MI : 0499 (unspecified role) 




19 


experunentalRoleA 


PSI-MI term(s) for the experimental role of 

interactor 


MI : 04 96 (bait ) MI : 4 98 (prey) | 
MI : 04 99 (unspecified role) 




20 


experimentalRoleB 


PSI-MI term(s) for the experimental role of 

interactor 


MI : 0496 (bait) I MI : 04 98 (prey) | 
MI : 0499 (unspecified role) 




21 


interactorTypeA 


PSI-MI term for the type of interactor A (ei- 
ther 'protein' or 'protein complex') 


MI: 0326 (protein) 




22 


interactorTypeB 


PSI-MI term for the type of interactor B (al- 


MI: 0326 (protein) 




29 


hostOrganismTaxid 


NCBI Taxonomy identifier for the host or- 
ganism 


taxid: 4932 (Saccharomyces cerevisiae) 




31 


creationDate 


Date when ppiTrim was run 


2011/05/11 




32 


updateDate 


Date when ppiTrim was run 


2011/05/11 




35 


checksumlnteraction 


ppiTrim ID for an interaction 


ppiTrim: tyuGkS0K2 3 ldh3YnSi6GbczJCFE= 




36 


negative 


Always 'false' 


false 



The above table shows short descriptions for the columns of lines output by ppiTrim with examples. The columns that are not used by 
ppiTrim (- output) are omitted. List of items are always separated by the | character (without any intervening spaces). This description 
only applies to ppiTrim output; the fuU PSI-MI 2.6 TAB format description can be found at http: / /code .google . com/p/psimi/ 
wiki/PsimiTab2 6Format Notes: *An interactor may be associated with several Gene IDs. In that case the smallest one is written 
in uid columns while the entire Ust is shown in alt colimms. t interactor A may be used to denote a protein complex. In that case 
the uidA is of the form complex : <ppiTrim ID>, while altA and aUasA are left empty. ^Multiple items are possible, originating 
from all source records contributing to the consoUdated interaction. *First ID is always the ppiTrim ID for the consolidated interaction, 
followed by the original IDs for all contributing interactions and their integer RIGIDs from iReflndex. The final item is the edge type 
code, •maxsources: an estimate of the maximal number of independent experiments contributing to the consolidated interaction; 
dmconsistency: consistency of contributing detection method terms. Values are one of invalid (no method terms present), single (only 
one method term), min (minimum term found but not maximum), max (maximum term found but not minimum), Mid full (both minimum 
and maximum term present in subcluster); conflicts: ppiTrim IDs of consolidated interactions with detection method term in conflict 
with the current one. 
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Supplementary Table 2: Remapping of obsolete PSI-MI terms 



Original Term Mapped Term Notes 



MI:0021 


colocalization by fluorescent probes cloning 


MI:0428 


imaging technique 




Ml:0022 


colocalization by immunostaining 


MI:0428 


imaging technique 


* 


Ml:()023 


colocalization/visualisation technologies 


MI:0428 


imaging technique 


* 


Ml:()025 


copurification 


MI:0401 


biochemical 




Ml:0059 


gst pull down 


MI:0096 


pull down 




Ml:0061 


his pull down 


MI:0096 


pull down 




MI:0079 


other biochemical technologies 


MI:0401 


biochemical 




MI:0109 


tap tag coimmunoprecipitation 


MI:0676 


tandem affinity purification 




MI:0045 


experimental interaction detection 


MI:0492 


in vitro 


t 


MI:0493 


in vivo 


MI:0493 


in vivo 


t 


MI:0000 


coip coimmunoprecipitation 


MI:0019 


coimmunoprecipitation 




MI:(X)00 


elisa enzyme-linked immunosorbent assay 


MI:0411 


enzyme linked immunosorbent assay 


* 



* Interaction type is also adjusted to Ml:0403 as recommended in psi-mi . obo; f HPRD terms are treated as a special case, see main 
text; * MPPI interactions in the human dataset. 



Supplementary Table 3: Mapping PTM labels from BioGRID into PSI-MI terms 



Original Term Mapped Term 



Acetylation 


MI:0192 


acetylation reaction 


Deacetylation 


MI:0I97 


deacetylation reaction 


Demethylation 


MI:0871 


demethylation reaction 


Dephosphorylation 


MI:0203 


dephosphorylation reaction 


Deubiquitination 


Ml:0204 


deubiquitination reaction 


Glucosylation 


Ml:0559 


glycosylation reaction 


Methylation 


Ml:0213 


methylation reaction 


Nedd(Rub 1 )ylation 


Ml:0567 


neddylation reaction 


No Modification 


Ml:0414 


enzymatic reaction 


Phosphorylation 


MI:0217 


phosphorylation reaction 


Prenylation 


MI:0211 


lipid addition 


Proteolytic Processing 


MI:0570 


protein cleavage 


Ribosylation 


MI:0557 


adp ribosylation reaction 


Stmioylation 


MI:0566 


sumoylation reaction 


Ubiquitination 


MI:0220 


ubiquitination reaction 



Supplementary Table 4: Processing source interactions (RIGIDs) 



Species 


Initial 


Without Gene ID 


Retained 


With Mapped Gene ID 


S. cerevisiae 


186530 


1272 


79931 


591 


H. sapiens 


138570 


1917 


84860 


7158 


D. liichiiKr^tisicr 


46925 


4988 


39200 


2176 



Statistics of initial processing of raw interactions from in terms of iRetlndex RIGIDs. A RIGID for an interaction is a unique hash derived 
from its interactants' sequences (with order not significant). Thus, multiple interactions with the same interactants share the same RIGID. 
Shown are the initial number, number removed due to missing Gene ID, total number of retained and the number retained containing at 
least one interactant with mapped Gene ID. Compared to Table 2 in the main text, this table does not contain a column showing the number 
of removed RIGIDs due to filtering criteria. This is becuase the ppiTrim filtering routine operates on raw interactions (corresponding to a 
single record from a source database) and some RIGIDs would be associated with both accepted and removed raw interactions. 
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Supplementary Table 5: Mapping CROGID identifiers from iReflndex into Gene IDs: details 



Species 


I 


V 





R 


P 


T 


M 


G 


S 


B 


S. cerevisiae 


5552 








607 


95 


461 





26 


21 


386 


H. sapiens 


11428 


11 





2615 


155 


2017 


71 


754 


429 





D. melanogaster 


7780 





30 


1569 


18 


814 


2 


124 


440 






Detailed statistics of mapping CROGIDs into Gene IDs. All numbers denote CROGIDs: directly mapped to valid Gene IDs in the iReflndex 
file (I); directly mapped to Gene IDs but the Gene IDs were updated during validation (V); directly mapped to obsolete Gene IDs (O); not 
directly mapped to Gene IDs - total orphans (R); orphans with PDB accession as a primary ID (P); orphans with Uniprot accession as a 
primary ID (T); additionally mapped to a valid Gene ID using mapping.txt file from iReflndex (M); additionally mapped to a valid Gene 
ID using a direct reference from Uniprot record (G); additionally mapped to a valid Gene ID using a gene name from Uniprot record (S); 
additionally mapped to a Gene ID that was not vaUd (B). 
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Supplementary Table 6: Randomly sampled deflated complexes from high throughput publications 



ppiTrim Complex ID 



Sources 



Pubmed ID Members 



Comments 



8AVRUHG7 6vkiFn2cZGICNZzr00Y= 



grid 



14759368 



CFT2, YSHl, PTAl, 
MPEl 

NUTL^ffiD7, MED4, 



JU+EOkq6ipLh9DJKRtGRLUvT7vM= 



HtTmhGlPyf IT2vFtRZ94uWw0rsY= 



LnNz f yPGShcG7 zkKynU6+f sK2eU= 



grid,mint 



grid 



14759368 



16429126 



grid 



16429126 



UBP6, RPT3, RPN9, 

RPTl, RPN8, RPN2, 
RPN7,RPN1 

10C3, HTBl, HTA2, 
HHF2, ISWl, KAP114, 
ITCl, RPS4A, VPSl, 
NAPl, RP031, ISW2, 
TBFl, BROl, MOI l, ,„„ 
" PSKl, NTHl, BMH2, 
RTG2, BMHl 



Part of mRNA cleavage/polyadenylation com- 
plex (4/10 proteins). 

Part of mediator complex. 

Part of proteasome. MINT does not contain 
complexes from the original paper. 



Part of Complex # 99. 



Part of complex # 147 (two core proteins plus 
three attachments). 



S2I6VRjFMWC6rkkM+oYXwKCg9YQ= 


grid 


16429126 


RPL4B, ^MKffJIO; 
MNNll, HOCl, MNN9, 
ANPl 


Core complex (# 111 - mannan polymerase 
II) + one attachment protein (RPL4B). 


lfRmAapl2ruoQq202YUJg55maFo= 


grid,mint 


16554755 


RSM24, RSM28, MRPS5, 
MRP13, MRPS35, 
RSM27, RSM7, RSM25, 
MRPS17, MRPS12, 
RSM19, MRP4 


Part of complex # 1. 


5tBkYOmK/Glh3vaQmlOnUoBHHMQ= 


grid,mint 


16554755 


CFT2, YSHl, MPEl, 
PAPl 


Part comp]^^^^^^^^^^^^^^^^^^^ 


9f2DVj2rDGeCP53LHOnWRMwql4A= 


grid,mint 


16554755 


KAP95, RTT103, VMA2, 
RAll, RATI, RPB2, 
SRPl 


True experimental association but not part of 
any derived complex. 


AVawv51+5Fqe3DquygD/XfyrXxE= 


grid,mint 


16554755 


RRP42, RRP45, RRP6, 
CSL4, MPP6, RRP4, 
LRPl, DDIl 


Part of complex #19. 


NOLEwovavMsFrQEdkSUt/mldeMc= 


grid,mint 


16554755 


CDC3, SHSl, CDCll, 
CDC 12 


Part of complex #121. 


WA51i87LjlwGp/EeF10V/YvbWlY= 


grid,mint 


16554755 


GTT2, TRXl, CRNl, 
SSA3, IPPl, CMDl, 
TRX2, TDHl, RPL40B, 
CDC21, 0YE2 


True experimental association but not part of 


YN/hQXQvzoB5HqrgPzVth28mGsY= 


grid,mint 


16554755 


RRP43, RRP42, RRP45, 
RRP40, DIS3, RRP6, 
RRP4, LRPl 


Part of complex # 19. 


ILRk+AgI SHpGOSAgkhDzN JWSvt 1= 


grid 


20489023 


RTG3, RTG2, TORI, 
TOR2, CKA2, MY02, 
MKS1,K0G1 


True experimental association. 


xWzvxeJFGq jkCih jmQVf 5gZhJjQ= 


dip,grid,mint 


20489023 


PUF3, SAMl, GCD6, 
SPT16, MTCl, YGK3, 
LSM12 


True experimental association. 



To partially investigate the fidelity of deflated complexes of type A and N, we randomly sampled 25 such complexes from the final 
ppiTrim yeast dataset and examined the original publications associated with them. This table contains 15 deflated complexes from high- 
throughput publications, while Supplemenary Table 7 contains the complexes from low-throughput publications. Most of high-throughput 
papers referred to in this table present both the lists of bait-prey associations and of derived complexes. The complexes delated by ppiTrim 
are often derived from the former and form only parts of the latter. In the last column of this table, the complex numbers referred to are 
labels used by the publication's authors. 
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Supplementary Table 7: Randomly sampled deflated complexes from low-throughput publications 



ppiTrim Complex ID 


Sources 


Pubmed ID 


Members 


Comments 










NOT a true complex. This is because of bad 


15VfQtoe5gxGNwPSY3AG0sq6A2U= 


grid 


9891041 


CCR4, HPRl, PAFl, 
SRB5,GAL11 


annotation of PAF1-SRB5 interaction by the 
BioGRID. Completely opposite inteipretation 
was given in the paper 


d7 9IdtwfTAENrH8CQ+c8CpS38 9Y= 


grid 


10329679 


YPTl, VPS21, YPT7, 

qpii 


True complex. This is the only experiment in 
the paper. , 








CDC39, CC'R4, CD(536, 




EtS4cgphEpTqJb/FS5qxyzf 0ke8= 


grid 


11733989 


CAF130, CAF40, 
CAF120, P0P2, N0T5, 
M0T2 


True complex. CAF120 is an unusual member 
that could almost be left out. 


2kOYGdwzWywSpN5mhK2 5gCcC6LQ= 


grid 


14769921 


GBP2, IMD3, TEFl, 
KEMl, CTK2, CTKl, 


True complex, except that TEFl should be 
TEF2. This is an error in the iRetlndex source 
file; the BioGRID website has the correct as- 
















BUD31, RPL2B, PRP19, 










CDC13, ATPl, RPS4A, 




Kd07BBUF07Sqy9NP3D01ixsS/TY= 


grid 


15303280 


SNU114, MDHl, 
MAM33, MRPL3, 
MRPL17, PRP8, PRP22, 
PABl, BRR2 


True association 










NOT a true complex, probably due to a typo 


ZAGz/ I ZqkEr3 /NTDLzPEDAD9cKo= 


grid 


16179952 


CDC40, UFDl, SSM4, 
UBX2 


in annotation. CDC40 cannot be found any- 
where in the paper and should most likely be 
CDC48. 


RDuOdsPANOQEadf SU5sv05If ihw= 


grid 


16286007 


SIN3, RCOl, RPD3, 
UME1,EAF3 


True complex. 








VPS36, VPS25, VPS28, 


Vps28 binds the other three, which form a 








SNF8 


complex. 


lmdypAN9kaHBdasLWS19x8K7KkE= 


grid 


20159987 


UB14, UFD2, PEX29. 
SSM4 


Biological association but indicated as 'NOT 
a stable complex' in the paper. 


aakRh 6qVahGxGvqHe 3 9 9+ f axP vA= 


grid 


20655618 


PEX13, PEXIO, PEX8, 
PEX12 


Association is correct, although mutant strain 
was used to obtain this particular complex. 



To paitially investigate the fidelity of deflated complexes of type A and N, we randomly sampled 25 such complexes from the final ppiTrim 
yeast dataset and examined the original publications associated with them. This table contains 10 deflated complexes from low-throughput 
publications, while Supplemenary Table 6 contains the complexes from high-throughput publications. 
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Supplementary Table 8: Summary of resolvable conflicts 



Consolidated terms 


Count 


MI:()018 (two hybrid), MI:()045 (experimental interaction detection), MI:0398 (two hybrid pooling approach), MI:0399 
(two hybrid fragment pooling approach) 

MI:0090 (protein complementation assay), MI:01 1 1 (dihydrofolate reductase reconstruction) 


3959 
2612 


MI:0090 (protein complementation assay), MLOl 12 (ubiquitin reconstruction) 
MI:0(X)4 (affinity chromatography technology), MI:0676 (tandem affinity purification) 


ion 

1840 


MI:0004 (affinity chromatography technology), MI:0007 (anti tag coimmunoprecipitation) 


1408 
1 1231 


Ml:()018 (two hybrid), MI:()045 (experimental interaction detection) 


954 




1 914 


MI:0045 (experimental interaction detection), Ml:0686 (unspecified method) 


628 




1 598 


MI:0018 (two hybrid), MI:0398 (two hybrid pooling approach) 

MI:0(X)4 (affinity chromatography technology), MI:0007 (anti tag coimmunoprecipitation), MI:0676 (tandem affinity 
purification) 

MI:0018 (two hybrid), MI:0045 (experimental interaction detection), MI:b686 (unspecified method) 


506 
444 

320 




1 217 


MI:0415 (enzymatic study), MI:0424 (protein kinase assay) 


192 


MI:0045 (experimental interaction detection), Ml:0081 (peptide array) ^^^^^^^^^^^^^^^^H 


1 150 


MI:0045 (experimental interaction detection), MI:0676 (tandem affinity purification) 


120 




MI:0018 (two hybrid), MI:0398 (two hybrid pooling approach) 


5394 




1 2796 


MI:0096 (pull down), Ml;()492 (in vitro), MI:0493 (in vivo) 


2760 




1 2134 


MI:0018 (two hybrid), MI:0492 (in vitro) 


1658 




MI:0018 (two hybrid), 1V[I:0397 (two hybrid array) 


1045 


MI:0096 (pull down), MI:0493 (in vivo) 

MI:0004 (affinity chromatography technology), MI:0006 (anti bait coimmunoprecipitation) 


513 

384 


MI:0004 (affinity chromatography technology), MI:0019 (coimmunoprecipitation) 
MI:{)0{)4 (affinity chromatography technology), MI:{)007 (anti tag coimmunoprecipitation) 


309 
195 


MI:01 14 (x-ray crystallography), Ml;0492 (in vitro) 


166 


Ml:()0()4 (affinity chromatography technology), MI:()096 (pull down) 


161 


MI:0047 (far western blotting), Ml:0492 (in vitro), Ml:0493 (in vivo) 


106 


MI:0018 (two hybrid), MI;()398 (two hybrid pooling approach) 


17738 


MI;0018 (two hybrid), MI:0399 (two hvbnd fragment poohng approach) ^^^^^mm 



All resolvable conflicts with counts of more than 100 for yeast (top), human (middle) and fruitfly (bottom) datasets are shown. 
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